Environment Setup

First we will install any needed software packages. This may take a while the first time through. By default these code chunks are not evaluated, you will have to change eval=TRUE to force them to install.

# Install devtools if it's not already installed
if (!require(devtools)) {
  install.packages("devtools")
}
# Install from GitHub
devtools::install_github("mannau/h5")
devtools::install_github("BUStools/BUSpaRse")
devtools::install_github("satijalab/seurat-wrappers")
devtools::install_github("LTLA/SingleR")
devtools::install_github('linxihui/NNLM')
devtools::install_github('cvarrichio/Matrix.utils')

Install BioC packages

if (!require(BiocManager)) {
  install.packages("BiocManager")
  BiocManager::install(c("DropletUtils", "BSgenome.Mmusculus.UCSC.mm10",
                       "AnnotationHub", "SingleR","biomaRt","tidyverse","scales","zeallot","densityClust"))
}

Install monocle3

if (!require(monocle3)) {
  BiocManager::install(c('BiocGenerics', 'DelayedArray', 'DelayedMatrixStats',
                       'limma', 'S4Vectors', 'SingleCellExperiment',
                       'SummarizedExperiment', 'batchelor','fgsea'))
  devtools::install_github('cole-trapnell-lab/leidenbase') # clustering package dependency for monocle3
  devtools::install_github('cole-trapnell-lab/monocle3')
  devtools::install_github('scfurl/m3addon') # Monocle3 addon functions from scfurl (optionally included to illustrate high-variance gene identification)
}

Install Velocyto.R

For velocity applications via velocyto.R you will need to have boost (e.g. sudo apt-get install libboost-dev) and openmp libraries installed. Instructions for this are not provided in the tutorial. If you cannot get velocyto.R to install, please skip this section.

if (!require(velocyto.R)) {
  BiocManager::install(c("pcaMethods"))
  devtools::install_github("velocyto-team/velocyto.R")
}

Load required libraries

Once we have installed the necessary packages, we can load them into the R session to begin our analysis.

library(AnnotationHub)
library(biomaRt)
library(BSgenome.Mmusculus.UCSC.mm10)
library(tidyverse)
library(velocyto.R)
library(BUSpaRse)
library(scales)
library(zeallot)
library(DropletUtils)
library(SingleR)
library(NNLM)
library(monocle3)
theme_set(theme_bw()) # Setting a more pleasing default theme

Overview

Learning Objectives

• Understand the basic steps of single cell RNA-Sequencing analysis workflows • Develop a baseline appreciation for cellular heterogeneity both between and within cell types • Identify cell state transitions via pseudotime and RNA Velocity analysis and interpret meanings in an embedding

Data

Description

Download

The dataset we are using is 10x 10k neurons from an E18 mouse (This is a large dataset ~25Gb).

Cells for this sample are from a combined cortex, hippocampus and sub ventricular zone of an E18 mouse. - 11,843 cells detected - Sequenced on Illumina NovaSeq with approximately 30,000 reads per cell - 28bp read1 (16bp Chromium barcode and 12bp UMI), 91bp read2 (transcript), and 8bp I7 sample barcode

10x Genomics sequencing workflow overview

10x Genomics sequencing workflow overview

Experimental questions: - What types of cells do we expect to find? - What/how many cellular ‘states’ do we observe at the transcriptional level? - How well defined are different cell types * What do ‘transitioning’ cell types/states look like? - Can we identify differentially expressed and/or marker genes between cell types? - What genes change expression over the course of neuronal development?

First, we will need to download the dataset, unless it has already been downloaded for you.

# Download data
if (!file.exists("./data/neuron_10k_v3_fastqs.tar")) {
  download.file("http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/3.0.0/neuron_10k_v3/neuron_10k_v3_fastqs.tar", "./data/neuron_10k_v3_fastqs.tar", method = "wget", quiet = TRUE)
}

After downloading, we need to expand (uncompress) the files

cd ./data
tar -xvf ./neuron_10k_v3_fastqs.tar

Kallistobus workflow - single cell preprocessing

Since we are downloading the raw sequencing reads for this single cell project, we will first have to ‘preprocess’ the data. In this step, the reads will be 1) mapped to a reference transcriptome to determine from which gene they are derived, 2) ‘demultiplexed’ or assigned to specific cells based on the presence of a cell-specific barcode sequence, and 3) duplicate reads from the same cell:gene fragment will be collapsed using a ‘unique molecular identifier’ (UMI).

This notebook demonstrates the use of command line tools kallisto and bustools. Please use kallisto >= 0.46, whose binary can be downloaded here. Also, please use bustools >= 0.39.3, whose binary of bustools can be found here. User interface of bustools has changed in version 0.39.3. For version 0.39.2, see earlier git commits of this notebook.

After you download the binary, you should decompress the file (if it is tar.gz) with tar -xzvf file.tar.gz in the bash terminal, and add the directory containing the binary to PATH by export PATH=$PATH:/foo/bar, where /foo/bar is the directory of interest. Then you can directly invoke the binary on the command line as we will do in this notebook.

Get gene annotations to prepare a reference index

In order to map the read sequences to their cognate genes/transcripts, we first must download and prepare a reference transcriptome. Here we are using the mouse reference transcriptome from Ensembl version 97. We download this dataset using a standard Bioconductor ‘AnnotationHub’ workflow.

# query AnnotationHub for mouse Ensembl annotation
ah <- AnnotationHub() # Connect to the Annotation Hub to query.
## snapshotDate(): 2019-10-29
record<-query(ah,pattern = c("Ensembl", "97", "Mus musculus", "EnsDb")) # Identify the AnnotationHub record associated with Mouse Ensembl v97

Once we’ve identified the AnnotationHub record id AH73905 for Mouse Ensembl v97, we can use this to retrieve all of the annotation records for this build.

# Get mouse Ensembl 97 annotation
edb <- ah[[names(record)]] # I need to comment this out to get the material to build currently...Please uncomment if you are running yourself.
## loading from cache
## require("ensembldb")

Get and Organize Required Preprocessing Files

As part of this exercise, we will be performing an RNA Velocity analysis (detailed below). This requires us to generate two seperate count matrices, one for reads mapping to ‘spliced’ transcripts and one for reads mapping to ‘unspliced’ transcripts. To do this dual mapping, we need to extract the information that we need to separately map the reads to cDNA sequences as well as intronic sequences. The BUSpaRse library has a function get_velocity_files() that will extract and process the necessary information from an AnnotationHub object.

Note: While this approach generates both the spliced and unspliced annotation records, for most of the downstream analysis we will just be using the count matrices associated with the spliced read mappings.

get_velocity_files(edb, L = 91, Genome = BSgenome.Mmusculus.UCSC.mm10,
                   out_path = "./output/neuron10k_velocity",
                   isoform_action = "separate")

This has created all the necessary annotation files and dumped them into the directory ./output/neuron10k_velocity

Build Reference index

Once we have the files for the cDNA sequences and the intronic sequences, we need to construct a reference kallisto index for the mapping. This indexing scans through the reference transcriptome and organizes it in a way that makes mapping the read sequences faster and more efficient.

# Intron index
kallisto index -i ./output/mm_cDNA_introns_97.idx ./output/neuron10k_velocity/cDNA_introns.fa

Using the kb wrapper

With kallisto and bustools, it takes several commands to go from fastq files to the spliced and unspliced matrices, which is quite cumbersome. So a wrapper called kb was written to condense those steps to one. The command line tool kb can be installed with

conda activate kallistobus
pip install kb-python

Once you have this installed, we can then we can use the following command to generate the spliced and unspliced count matrices from the raw reads and the assembled reference index. The count matrix files will be the primary data sources for the downstream/secondary analysis.

Example \(genes X cells\) matrix of Unique Molecular Identifier (UMI) counts:

cell1 cell2 cell3
gene1 0 2 0
gene2 15 7 3
gene3 1 0 2

cd ./output/neuron10k_velocity
kb count -i ../mm_cDNA_introns_97.idx -g tr2g.tsv -x 10xv3 -o kb \
-c1 cDNA_tx_to_capture.txt -c2 introns_tx_to_capture.txt --lamanno \
../../data/neuron_10k_v3_fastqs/neuron_10k_v3_S1_L002_R1_001.fastq.gz \
../../data/neuron_10k_v3_fastqs/neuron_10k_v3_S1_L002_R2_001.fastq.gz \
../../data/neuron_10k_v3_fastqs/neuron_10k_v3_S1_L001_R1_001.fastq.gz \
../../data/neuron_10k_v3_fastqs/neuron_10k_v3_S1_L001_R2_001.fastq.gz

Preprocessing

Once the above preprocessing is complete, we now have two different matrices: one tallying the number of reads for each gene that mapped to spliced cDNA isoforms, and one for the reads matching unspliced isoforms. As mentioned above, the bulk of the downstream analysis will be performed on the ‘spliced’ matrix. For each matrix, we also have two accessory files: one corresponding to the gene names for each row, and another with the cell IDs (barcodes) for each column. All of these files can be read into R using the function read_velocity_output() from BUSpaRse:

d <- "./output/neuron10k_velocity/kb/counts_unfiltered" # top-level directory containing requisite files with 'spliced' and 'unspliced' prefixes.
c(spliced, unspliced) %<-% BUSpaRse::read_velocity_output(spliced_dir = d,
                                                spliced_name = "spliced",
                                                unspliced_dir = d,
                                                unspliced_name = "unspliced")

spliced[1:6,1:6]
## 6 x 6 sparse Matrix of class "dgCMatrix"
##                       AAACCCAAGAAACACT AAACCCAAGAAACCAT AAACCCAAGAAACTCA
## ENSMUSG00000000001.4                 .                .                .
## ENSMUSG00000000003.15                .                .                .
## ENSMUSG00000020875.9                 .                .                .
## ENSMUSG00000000028.15                .                .                .
## ENSMUSG00000048583.16                .                .                .
## ENSMUSG00000000049.11                .                .                .
##                       AAACCCAAGAAAGTCT AAACCCAAGAAATCCA AAACCCAAGAAATTGC
## ENSMUSG00000000001.4                 .                .                .
## ENSMUSG00000000003.15                .                .                .
## ENSMUSG00000020875.9                 .                .                .
## ENSMUSG00000000028.15                .                .                .
## ENSMUSG00000048583.16                .                .                .
## ENSMUSG00000000049.11                .                .                .
unspliced[1:6,1:6]
## 6 x 6 sparse Matrix of class "dgCMatrix"
##                       AAACCCAAGAAACACT AAACCCAAGAAACCCG AAACCCAAGAAACTGT
## ENSMUSG00000000001.4                 .                .                .
## ENSMUSG00000000003.15                .                .                .
## ENSMUSG00000020875.9                 .                .                .
## ENSMUSG00000000028.15                .                .                .
## ENSMUSG00000048583.16                .                .                .
## ENSMUSG00000000049.11                .                .                .
##                       AAACCCAAGAAATCCA AAACCCAAGAAATTGC AAACCCAAGAACAGGA
## ENSMUSG00000000001.4                 .                .                .
## ENSMUSG00000000003.15                .                .                .
## ENSMUSG00000020875.9                 .                .                .
## ENSMUSG00000000028.15                .                .                .
## ENSMUSG00000048583.16                .                .                .
## ENSMUSG00000000049.11                .                .                .

What fraction of UMIs are from unspliced transcripts?

sum(unspliced@x) / (sum(unspliced@x) + sum(spliced@x))
## [1] 0.4285891

We expect around 10,000 cells. There are over 10 times more barcodes here, since most barcodes are from empty droplets. The number of genes is constant and represents ‘all annotated genes’ from the Ensembl v97 mouse database.

dim(spliced)
## [1]   55487 1393951
dim(unspliced)
## [1]   55487 1091523

Here rownames are gene identifiers and column names are the unique cell barcode sequences (cell_id).

Most barcodes only have 0 or 1 UMIs detected.

tot_count <- Matrix::colSums(spliced)
summary(tot_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     1.0     1.0    57.4     1.0 47529.0

Remove empty droplets

Since most of the droplets that are generated by the 10x single cell workflow will not contain cells, we need to identify and remove those ‘empty droplets’ from each matrix. A commonly used method to estimate the number of empty droplets is barcode ranking knee and inflection points, as those are often assumed to represent transitions between two components of a distribution (in this case ‘cells’ vs ‘background’). While more sophisticated methods exist (e.g. see emptyDrops in DropletUtils), for simplicity, we will use the barcode ranking method here. we will use the spliced matrix for filtering, though both matrices have similar inflection points.

Here we calculate the rank order of the cells based on the number of UMIs for each.

bc_rank <- barcodeRanks(spliced)
bc_uns <- barcodeRanks(unspliced)

We can plot number of UMIs on the x axis, and cell barcode rank on the y axis. See this blog post by Lior Pachter for a more detailed explanation.

#' Knee plot for filtering empty droplets
#'
#' Visualizes the inflection point to filter empty droplets. This function plots
#' different datasets with a different color. Facets can be added after calling
#' this function with `facet_*` functions.
#'
#' @param bc_ranks A named list of output from `DropletUtil::barcodeRanks`.
#' @return A ggplot2 object.
#' @importFrom tibble tibble
#' @importFrom purrr map map_dbl
#' @importFrom dplyr distinct
#' @importFrom ggplot2 geom_line geom_hline geom_vline scale_x_log10 scale_y_log10
#' @importFrom tidyr unnest
#' @export
knee_plot <- function(bc_ranks) {
  # purrr pluck shorthand doesn't work on S4Vector DataFrame
  knee_plt <- tibble(rank = map(bc_ranks, ~ .x[["rank"]]),
                     total = map(bc_ranks, ~ .x[["total"]]),
                     dataset = names(bc_ranks)) %>%
    unnest(cols = c(rank, total)) %>%
    distinct() %>%
    dplyr::filter(total > 0)
  annot <- tibble(inflection = map_dbl(bc_ranks, ~ metadata(.x)[["inflection"]]),
                  rank_cutoff = map_dbl(bc_ranks,
                                        ~ max(.x$rank[.x$total >
                                                        metadata(.x)[["inflection"]]])),
                  dataset = names(bc_ranks))
  p <- ggplot(knee_plt, aes(rank, total, color = dataset)) +
    geom_line() +
    geom_hline(aes(yintercept = inflection, color = dataset),
               data = annot, linetype = 2) +
    geom_vline(aes(xintercept = rank_cutoff, color = dataset),
               data = annot, linetype = 2) +
    scale_x_log10() +
    scale_y_log10() +
    labs(x = "Rank", y = "Total UMIs")
  return(p)
}
knee_plot(list(spliced = bc_rank, unspliced = bc_uns)) +
  coord_flip()

Here we can see the knee plot (reading from right to left) and we can identify an inflection point where there is a sharp dropoff in the information content for barcodes. This is (theoretically) the transition between barcodes corresponding to droplets that actually contained cells, vs those that only had a background level of RNA. As expected, it’s roughtly around the 10,000 cell mark which is how many cells we estimated were loaded.

We will use this inflection point (1693 UMIs) to enforce a threshold minimum number of UMIs to filter the barcodes down to only those that contain cells. We can additionally remove any genes with no detectable signal across any cells. We then subset both the spliced and unspliced matrices along both of these dimensions.

bcs_use <- colnames(spliced)[tot_count > metadata(bc_rank)$inflection]
# Remove genes that aren't detected
tot_genes <- Matrix::rowSums(spliced)
genes_use <- rownames(spliced)[tot_genes > 0]
sf <- spliced[genes_use, bcs_use]
uf <- unspliced[genes_use, bcs_use]

So what are we left with?

dim(sf)
## [1] 24825 10391

24825 rows (genes) and 10391 columns (cells)

Monocle3 secondary analysis

That concludes the preprocessing and initial filtering to remove empty droplets, now we will need to import the spliced count matrix into a single cell analysis framework for downstream/secondary analysis. Here we will be using the monocle3 framework.

Create Monocle cell_data_set object from spliced matrix

The cell_data_set class defines how the single cell data are stored, indexed, manipulated, and sliced. The three components that we need to create a cell_data_set instance are 1) the sparse count matrix, 2) gene-level annotation, and 3) cell-level annotation. We don’t have much for #’s 2 or 3 at this point other than ids, but it’s enough to get started.

cellMeta<-data.frame("barcode"=colnames(sf))
rownames(cellMeta)<-cellMeta$barcode
geneMeta<-data.frame("gene_id"=rownames(sf))
rownames(geneMeta)<-geneMeta$gene_id
dat <- monocle3::new_cell_data_set(sf,
                         cell_metadata = cellMeta,
                         gene_metadata = geneMeta)
## Warning in monocle3::new_cell_data_set(sf, cell_metadata = cellMeta,
## gene_metadata = geneMeta): Warning: gene_metadata must contain a column verbatim
## named 'gene_short_name' for certain functions.

We have now created our base monocle3 cell_data_set object called dat. Lets peek around to see what’s inside:

dat
## class: cell_data_set 
## dim: 24825 10391 
## metadata(1): cds_version
## assays(1): counts
## rownames(24825): ENSMUSG00000000001.4 ENSMUSG00000020875.9 ...
##   ENSMUSG00000118400.1 ENSMUSG00000118435.1
## rowData names(1): gene_id
## colnames(10391): AAACCCACACGCGGTT AAACCCACAGCATACT ... TTTGTTGGTTCAACGT
##   TTTGTTGTCCGTTTCG
## colData names(2): barcode Size_Factor
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

This is a summary report of dat.

To access the expression matrix we use the exprs() method:

exprs(dat)[1:6,1:6]
## 6 x 6 sparse Matrix of class "dgCMatrix"
##                       AAACCCACACGCGGTT AAACCCACAGCATACT AAACCCACATACCATG
## ENSMUSG00000000001.4                 .                .                .
## ENSMUSG00000020875.9                 .                .                .
## ENSMUSG00000000028.15                .                .                .
## ENSMUSG00000048583.16                .                .                .
## ENSMUSG00000000049.11                .                .                .
## ENSMUSG00000000058.6                 .                .                .
##                       AAACCCAGTCGCACAC AAACCCAGTGCACATT AAACCCAGTGGTAATA
## ENSMUSG00000000001.4                 .                1                1
## ENSMUSG00000020875.9                 .                .                .
## ENSMUSG00000000028.15                .                .                .
## ENSMUSG00000048583.16                .                .                .
## ENSMUSG00000000049.11                .                .                .
## ENSMUSG00000000058.6                 .                .                .

To access the gene annotations we use the fData() method:

fData(dat)
## DataFrame with 24825 rows and 1 column
##                                     gene_id
##                                    <factor>
## ENSMUSG00000000001.4   ENSMUSG00000000001.4
## ENSMUSG00000020875.9   ENSMUSG00000020875.9
## ENSMUSG00000000028.15 ENSMUSG00000000028.15
## ENSMUSG00000048583.16 ENSMUSG00000048583.16
## ENSMUSG00000000049.11 ENSMUSG00000000049.11
## ...                                     ...
## ENSMUSG00000118398.1   ENSMUSG00000118398.1
## ENSMUSG00000118467.1   ENSMUSG00000118467.1
## ENSMUSG00000118421.1   ENSMUSG00000118421.1
## ENSMUSG00000118400.1   ENSMUSG00000118400.1
## ENSMUSG00000118435.1   ENSMUSG00000118435.1

To access the cell annotations we use the pData() method:

pData(dat)
## DataFrame with 10391 rows and 2 columns
##                           barcode       Size_Factor
##                          <factor>         <numeric>
## AAACCCACACGCGGTT AAACCCACACGCGGTT 0.796261808586696
## AAACCCACAGCATACT AAACCCACAGCATACT 0.426575001473047
## AAACCCACATACCATG AAACCCACATACCATG 0.502656477212058
## AAACCCAGTCGCACAC AAACCCAGTCGCACAC  1.27194394440038
## AAACCCAGTGCACATT AAACCCAGTGCACATT 0.495048329638157
## ...                           ...               ...
## TTTGTTGGTAGCTAAA TTTGTTGGTAGCTAAA   1.3618584157283
## TTTGTTGGTATCCCAA TTTGTTGGTATCCCAA 0.579775427620237
## TTTGTTGGTCCGAAAG TTTGTTGGTCCGAAAG 0.345133239943332
## TTTGTTGGTTCAACGT TTTGTTGGTTCAACGT 0.860066500740549
## TTTGTTGTCCGTTTCG TTTGTTGTCCGTTTCG 0.606749769018614

There’s not much annotation in there as of yet, lets see what we can add. ## Add gene-level annotation from BioMaRt Using the gene_id information in the featureData slot, we can fetch external annotations for each gene and merge them so we can get more meaningful gene information. biomaRt is an interface in Bioconductor to get information associated with gene_ids.

mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
                         dataset = "mmusculus_gene_ensembl",
                         host = 'ensembl.org')
t2g <- biomaRt::getBM(attributes = c("ensembl_gene_id",
                                     "external_gene_name",
                                     "chromosome_name",
                                     "start_position",
                                     "end_position"
                                     ), mart = mart) # Fetch annotation information for all gene_ids
## Cache found
fData(dat)$gene_id_trimmed<-str_split_fixed(fData(dat)$gene_id,pattern="\\.",2)[,1] #Trim off the version identifier from the gene_ids
fData(dat)<-dplyr::left_join(as.data.frame(fData(dat)),t2g,by=c("gene_id_trimmed" = "ensembl_gene_id"),sort=FALSE,all.x=TRUE,keep=TRUE) # merge annotation into existing fData().
fData(dat)$gene_short_name<-fData(dat)$external_gene_name # make a field named "gene_short_name" in fData()
head(fData(dat)) #Inspect
## DataFrame with 6 rows and 7 columns
##                                     gene_id    gene_id_trimmed
##                                    <factor>        <character>
## ENSMUSG00000000001.4   ENSMUSG00000000001.4 ENSMUSG00000000001
## ENSMUSG00000020875.9   ENSMUSG00000020875.9 ENSMUSG00000020875
## ENSMUSG00000000028.15 ENSMUSG00000000028.15 ENSMUSG00000000028
## ENSMUSG00000048583.16 ENSMUSG00000048583.16 ENSMUSG00000048583
## ENSMUSG00000000049.11 ENSMUSG00000000049.11 ENSMUSG00000000049
## ENSMUSG00000000058.6   ENSMUSG00000000058.6 ENSMUSG00000000058
##                       external_gene_name chromosome_name start_position
##                              <character>     <character>      <integer>
## ENSMUSG00000000001.4               Gnai3               3      108107280
## ENSMUSG00000020875.9               Hoxb9              11       96271457
## ENSMUSG00000000028.15              Cdc45              16       18780447
## ENSMUSG00000048583.16               Igf2               7      142650766
## ENSMUSG00000000049.11               Apoh              11      108343354
## ENSMUSG00000000058.6                Cav2               6       17281185
##                       end_position gene_short_name
##                          <integer>     <character>
## ENSMUSG00000000001.4     108146146           Gnai3
## ENSMUSG00000020875.9      96276595           Hoxb9
## ENSMUSG00000000028.15     18811987           Cdc45
## ENSMUSG00000048583.16    142666816            Igf2
## ENSMUSG00000000049.11    108414396            Apoh
## ENSMUSG00000000058.6      17289115            Cav2

Now we have a bit more useful information associated with each gene.

Data QC and Filtering

Before we move on to exploration and annotation of the data, we first need to get some summary statistics such as a scaling factor and an estimate of the dispersion for each gene (variance in excess of what is expected for a given model fit).

Dispersion Estimation

The dispersion estimate is a useful metric to identify genes that have higher variance than expected. Per-gene technical variance in scRNA-Seq data is well modeled by a negative binomial fit. Genes with excess variation (overdispersion) should have high biological variation above and beyond the modeled technical variance.

dat<-m3addon::calculate_gene_dispersion(dat,id_tag="gene_id") 

# Select genes with high dispersion relative to fit
dat<-m3addon::select_genes(dat)
m3addon::plot_gene_dispersion(dat) + scale_color_manual(values=c("black","red"))

This plot shows how the variance in gene expression measures changes as a function of the mean expression level for each gene. The fit line is a negative binomial approximation that corresponds to the ‘technical variance’ expected at all levels of expression. Genes that have a high residual to this fit have more variance than expected and that may represent additional ‘biological variance’. Subsetting to these genes can help improve downstream analyses by focusing on genes that better segregate different cell types or states.

Monocle3 preprocessing

The term ‘preprocessing’ comes up again even though we’re past the initial hurdle of generating the matrix. Standard secondary preprocessing for single cell RNA-Seq involves projecting expression data into the top principal components to identify ranked sources of variation. This is usually done after log-transforming the data to stabilize the variance across the dynamic range of gene expression. Monocle3 conveniently provides a function preprocess_cds() that will do this transform and PCA analysis. We start with a relatively high number of principal components to estimate, 100.

dat <- preprocess_cds(dat,
                      num_dim = 100, 
                      method = "PCA",
                      norm_method = "log",
                      verbose=T,
                      cores=4)
## Remove noise by PCA ...
plot_pc_variance_explained(dat)

If we look at the variance explained for each principle component, we can see that it starts to trail off after a point. We probably won’t get much more useful information after 40 or so components. So we subset to only the first 40 components and preprocess again.

nDims<-40
dat <- preprocess_cds(dat,
                      num_dim = nDims,
                      method = "PCA",
                      norm_method = "log",
                      verbose=T,
                      cores=4)
## Remove noise by PCA ...

QC metrics

Next we can actually start to assess some of the important quality metrices for each gene and each cell.

Minimum number of cells expressing a given gene

It’s a good idea, and saves time/effort to identify and only consider genes whose expression levels are detected in a certain number or proportion of cells within your assay.

dat<-detect_genes(dat)
cellCutoff<-20 # This number is arbitrary
expressed_genes <- row.names(subset(fData(dat),
    num_cells_expressed >= cellCutoff))

length(expressed_genes)
## [1] 15365

Once we’ve detected the number of cells expressing each gene, we can look at the distribution to get a better feel

hist(fData(dat)$num_cells_expressed,col="red",breaks=50,main="Number of cells expressing a given gene")

Most genes are not detectably expressed in more than one cell (this is the nature of single cell gene expression assays, and gene expression in general)

Lets log transform and look again. This time, we’ll add a threshold line showing our cutoff for ‘expressed genes’.

hist(log10(fData(dat)$num_cells_expressed),col="red",breaks=50,main="log10 Number of cells expressing a given gene")
abline(v=log10(cellCutoff),lty="dashed")

We have now identified a total of 15365 that are detectably expressed in at least 20 cells in our dataset.

Distribution of gene mean copies per cell

What is the average expression level (in mRNA Copies per cell) for each gene?

fData(dat)$mean_cpc<-Matrix::rowMeans(exprs(dat))
hist(log10(fData(dat[expressed_genes,])$mean_cpc),col="purple",breaks=50,main="Mean RNA copies per cell")

Cell QC metrics

Distribution of detected genes across cells

How many genes are expressed in a given cell?

hist(pData(dat)$num_genes_expressed,col="darkgreen",breaks=50,main="Number of genes expressed per cell")

Mt-genome proportion

A high proportion of mitochondrial genes may indicate a lower than ideal capture efficiency for a given cell. Here we identify the subset of mitochondrial genes and look at the proportion of reads mapping to ‘mt-’ genes vs genomic genes.

mito_genes<-fData(dat)$gene_id[grepl("^mt-",fData(dat)$gene_short_name)]

pData(dat)$mt_reads <- Matrix::colSums(exprs(dat)[mito_genes,])
pData(dat)$total_reads  <- Matrix::colSums(exprs(dat))
pData(dat)$mito_ratio <- pData(dat)$mt_reads/pData(dat)$total_reads

ggplot(as.data.frame(pData(dat)),
       aes(x = num_genes_expressed, y = mito_ratio)) + 
       geom_point() +
       labs(x = "Number of genes", y = "Mitochondrial ratio") +
       scale_color_brewer(palette = "Set1") +
       theme(legend.position = "none") + 
       ggtitle("Number of genes vs Mitochondrial ratio") + 
       monocle3:::monocle_theme_opts()

Total mRNAs per cell

To get a general picture of the capture efficiency and depth of information for each cell we can look at the total mRNA mass recovered per cell.

pData(dat)$Total_mRNA<-Matrix::colSums(exprs(dat))

hist(pData(dat)$Total_mRNA,col="darkblue",breaks=50,main="Total mRNAs sequenced per cell")

Bonus question: How many mRNAs do we expect are in a given eukaryotic cell?

For each of the above QC ‘criterion’ we can define thresholds that can be used to filter cells/genes to improve the quality of the dataset. Here is usually where obvious doublet cells (more than one cell is associated with a single barcode sequence) or low-quality cells are removed prior to doing any further statistical interpretations.

Reduce dimensionality and visualize the cells

Now we’re ready to visualize the cells. To do so, you can use either t-SNE, which is very popular in single-cell RNA-seq, or UMAP, which is increasingly common. Monocle 3 uses UMAP by default, since it is both faster and better suited for clustering and trajectory analysis in RNA-seq. To reduce the dimensionality of the data down into the X, Y plane so we can plot it easily, we call reduce_dimension():

dat <- reduce_dimension(dat,
                        verbose=TRUE,
                        reduction_method="UMAP",
                        cores = 4)
## No preprocess_method specified, using preprocess_method = 'PCA'
## Note: reduce_dimension will produce slightly different output each time you run it unless you set 'umap.fast_sgd = FALSE' and 'cores = 1'
## Running Uniform Manifold Approximation and Projection
## 23:29:54 UMAP embedding parameters a = 1.577 b = 0.8951
## Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
## Also defined by 'spam'
## 23:29:54 Read 10391 rows and found 40 numeric columns
## Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
## Also defined by 'spam'
## 23:29:55 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 23:29:56 Writing NN index file to temp file /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/file11b512a86e56d
## 23:29:56 Searching Annoy index using 4 threads, search_k = 1500
## 23:29:57 Annoy recall = 100%
## 23:29:57 Commencing smooth kNN distance calibration using 4 threads
## 23:29:59 Initializing from normalized Laplacian + noise
## 23:29:59 Commencing optimization for 200 epochs, with 210042 positive edges
## 23:30:04 Optimization finished

To visualize the dimensionality reduction we use plot_cells():

plot_cells(dat)
## No trajectory to plot. Has learn_graph() been called yet?
## cluster not found in colData(cds), cells will not be colored
## cluster_cells() has not been called yet, can't color cells by cluster

And we can look at how different features of the cells are distributed across this embedding. For now we only have technical features to view.

plot_cells(dat,color_cells_by="num_genes_expressed",cell_size=0.75)
## No trajectory to plot. Has learn_graph() been called yet?
## Cells aren't colored in a way that allows them to be grouped.

plot_cells(dat,color_cells_by="Total_mRNA",cell_size=0.75)
## No trajectory to plot. Has learn_graph() been called yet?
## Cells aren't colored in a way that allows them to be grouped.

plot_cells(dat,color_cells_by="mito_ratio",cell_size=0.75)
## No trajectory to plot. Has learn_graph() been called yet?
## Cells aren't colored in a way that allows them to be grouped.

What can we deduce/hypothesize about the embedding (we will formally test later)? Number of cell types? Diversity of cell types?

Embedding shapes

  • Sometimes, clearly defined cell types are obvious puncta in a 2D embedding
  • Other times, what you think of as a single cell type may be broken up into several ‘subtypes’
  • Still more, a single cell ‘type’ may consist of several ‘cell states’ that might present as more of an amorphous shape in an embedding
  • Cells in an ergodic transitioning state may be represented as a ‘pseeudotemporal trajectory’ as different cells pass through different phases of the transition. - These trajectories can be very useful as a high-resolution timecourse for how cells respond to changes or cues.
  • What types of shape:stories might be represented in these clusters? What cell types/states do we expect to find in an E18.5 developing mouse cortex?

We can also map the expression level of individual genes (markers) onto this embedding to start to parse out meaning.

plot_cells(dat,genes=c("Fezf2","Pax6","Satb2","Eomes","Tle4","Bcl11b"),cell_size=0.75)
## No trajectory to plot. Has learn_graph() been called yet?

Cluster similar cells into ‘cell types’

We next want to impose a clustering solution onto this embedding to group cells with similar transcriptional profiles together. We use cluster_cells() to perform this function in monocle3:

dat <- cluster_cells(dat, 
                     verbose=TRUE,
                     resolution=1e-5)
## Running leiden clustering algorithm ...
## Run kNN based graph clustering starts:
##   -Input data of 10391 rows and 2 columns
##   -k is set to 20
##   Finding nearest neighbors...
##     DONE. Run time: 0.179000000000002s
##   Compute jaccard coefficient between nearest-neighbor sets ...
##     DONE. Run time: 0.00399999999999068s
##   Build undirected graph from the weighted links ...
##     DONE. Run time: 0.092000000000013s
##   Run leiden clustering ...
## leiden_find_partition: partition_type       CPMVertexPartition
## leiden_find_partition: seed                 1134163607
## leiden_find_partition: resolution_parameter 1e-05
## leiden_find_partition: num_iter             2
## leiden_find_partition: initial_membership   0
## leiden_find_partition: edge_weights         0
## leiden_find_partition: node_sizes           0
## leiden_find_partition: number vertices      10391
## leiden_find_partition: number edges         207820
##     Current resolution is 1e-05; Modularity is 0.538322394942393; Quality is 415123.64588; Significance is 125480.429592668; Number of clusters is 9
##     Done. Run time: 0.345475912094116s
##   Clustering statistics
##    resolution_parameter  quality modularity significance cluster_count selected
##                   1e-05 415123.6  0.5383224     125480.4             9        *
## 
##   Cell counts by cluster
##    cluster cell_count cell_fraction
##          1       6730         0.648
##          2       1536         0.148
##          3       1446         0.139
##          4        246         0.024
##          5        134         0.013
##          6        105         0.010
##          7         77         0.007
##          8         73         0.007
##          9         44         0.004
## 
##   Maximal modularity is 0.538322394942393 for resolution parameter 1e-05
## 
##   Run kNN based graph clustering DONE.
##   -Number of clusters: 9
plot_cells(dat, color_cells_by="cluster", cell_size=0.75, group_label_size = 5, show_trajectory_graph=FALSE)

Remember, clustering is a useful tool but is also arbitrary for continuous cell-cell transitions.

De Novo cell type annotation

As a crude first-pass annotation, we can leverage publicly available, bulk RNA-Seq gene expression information for specific cell types to try and ‘learn’ cell type annotations for each cell. This ‘transfer learning’ approach can provide a resonable starting point for coarse cell type identification. The SingleR package leverages reference transcriptomic datasets of pure cell types to infer the cell of origin of each of the single cells independently. First we need to fetch a reference dataset of bulk RNA-Seq expression profiles.

mouse.rnaseq <- SingleR::MouseRNAseqData(ensembl = TRUE)
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## snapshotDate(): 2019-10-22
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## see ?SingleR and browseVignettes('SingleR') for documentation
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## downloading 1 resources
## retrieving 1 resource
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## loading from cache
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## see ?SingleR and browseVignettes('SingleR') for documentation
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## downloading 1 resources
## retrieving 1 resource
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## loading from cache
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## Using temporary cache /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/BiocFileCache
## snapshotDate(): 2019-10-29
## loading from cache
## Warning: Unable to map 2180 of 21214 requested IDs.

And then, we can use this dataset to compare our single cell expression profiles against to apply coarse cell type labels.

dat.exprs<-exprs(dat[expressed_genes,])
rownames(dat.exprs) <- str_remove(rownames(dat.exprs), "\\.\\d+")

system.time(annots<-SingleR(dat.exprs,
                ref = mouse.rnaseq, labels = colData(mouse.rnaseq)$label.fine,
                de.method = "classic", method = "single", BPPARAM = MulticoreParam(4))
)

Apply learned annotations to cell_data_set object

Once we’ve learned labels for each cell, we can then add this information into the phenotypeData (pData()) slot of our main object.

pData(dat)$cell_type <- annots$pruned.labels
plot_cells(dat,color_cells_by="cell_type", group_cells_by = "cluster", cell_size=0.75, label_cell_groups = FALSE)
## No trajectory to plot. Has learn_graph() been called yet?

table(pData(dat)$cell_type)
## 
##                 aNSCs            Astrocytes               B cells 
##                   195                     2                     2 
##     Endothelial cells             Ependymal          Erythrocytes 
##                    84                     2                   140 
##           Fibroblasts Fibroblasts activated Fibroblasts senescent 
##                    16                    27                    10 
##          Granulocytes           Macrophages Macrophages activated 
##                     1                     9                     3 
##             Microglia             Monocytes               Neurons 
##                    19                     4                  1364 
##                  NPCs      Oligodendrocytes                  OPCs 
##                  8207                     1                   200 
##                 qNSCs 
##                    13

The algorithm does a reasonable job at identifying major cell types, and identifies a number of different cell types in our dataset. Before we go any further, lets filter the dataset down to only cell types that we might be interested in for downstream analysis.

inds <- annots$pruned.labels %in% c("NPCs", "Neurons", "OPCs", "Oligodendrocytes",
                                    "qNSCs", "aNSCs", "Astrocytes", "Ependymal")
# Only keep these cell types
cells_use <- row.names(annots)[inds]
dat <- dat[,cells_use]
table(pData(dat)$cell_type)
## 
##            aNSCs       Astrocytes        Ependymal          Neurons 
##              195                2                2             1364 
##             NPCs Oligodendrocytes             OPCs            qNSCs 
##             8207                1              200               13
# Preprocess again on filtered dataset
nDims <- 20

dat <- preprocess_cds(dat,
                      num_dim = nDims,
                      method = "PCA",
                      verbose=TRUE)
## Remove noise by PCA ...
dat <- reduce_dimension(dat,
                        verbose=TRUE,
                        reduction_method="UMAP",
                        cores = 4)
## No preprocess_method specified, using preprocess_method = 'PCA'
## Note: reduce_dimension will produce slightly different output each time you run it unless you set 'umap.fast_sgd = FALSE' and 'cores = 1'
## Running Uniform Manifold Approximation and Projection
## 23:30:36 UMAP embedding parameters a = 1.577 b = 0.8951
## Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
## Also defined by 'spam'
## 23:30:36 Read 9984 rows and found 20 numeric columns
## Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
## Also defined by 'spam'
## 23:30:36 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 23:30:37 Writing NN index file to temp file /var/folders/rn/f0sgm51j3wqf9p3my_z549c80000gn/T//RtmpuXNzni/file11b511c83d6bb
## 23:30:37 Searching Annoy index using 4 threads, search_k = 1500
## 23:30:38 Annoy recall = 100%
## 23:30:39 Commencing smooth kNN distance calibration using 4 threads
## 23:30:40 Initializing from normalized Laplacian + noise
## 23:30:40 Commencing optimization for 500 epochs, with 197716 positive edges
## 23:30:50 Optimization finished
dat <- cluster_cells(dat,
                     verbose=TRUE,
                     resolution=1e-5)
## Running leiden clustering algorithm ...
## Run kNN based graph clustering starts:
##   -Input data of 9984 rows and 2 columns
##   -k is set to 20
##   Finding nearest neighbors...
##     DONE. Run time: 0.02800000000002s
##   Compute jaccard coefficient between nearest-neighbor sets ...
##     DONE. Run time: 0.00100000000000477s
##   Build undirected graph from the weighted links ...
##     DONE. Run time: 0.0340000000000202s
##   Run leiden clustering ...
## leiden_find_partition: partition_type       CPMVertexPartition
## leiden_find_partition: seed                 1690597741
## leiden_find_partition: resolution_parameter 1e-05
## leiden_find_partition: num_iter             2
## leiden_find_partition: initial_membership   0
## leiden_find_partition: edge_weights         0
## leiden_find_partition: node_sizes           0
## leiden_find_partition: number vertices      9984
## leiden_find_partition: number edges         199680
##     Current resolution is 1e-05; Modularity is 0.480268645377476; Quality is 398790.13884; Significance is 79807.8125645107; Number of clusters is 4
##     Done. Run time: 0.274734020233154s
## 
##   Clustering statistics
##    resolution_parameter  quality modularity significance cluster_count selected
##                   1e-05 398790.1  0.4802686     79807.81             4        *
## 
##   Cell counts by cluster
##    cluster cell_count cell_fraction
##          1       6875         0.689
##          2       1563         0.157
##          3       1441         0.144
##          4        105         0.011
## 
##   Maximal modularity is 0.480268645377476 for resolution parameter 1e-05
## 
##   Run kNN based graph clustering DONE.
##   -Number of clusters: 4
plot_cells(dat,color_cells_by="cell_type", group_cells_by = "cluster", cell_size=0.75, label_cell_groups = FALSE) + scale_color_brewer(palette="Set1")
## No trajectory to plot. Has learn_graph() been called yet?

Finding marker genes

One of our objectives is to find marker genes expressed by each cluster/cell type. Once cells have been clustered, we can ask what genes makes them different from one another. To do that, start by calling the top_markers() function:

marker_test_res <- top_markers(dat, group_cells_by="cell_type", 
                               reference_cells=500, cores=4)
top_specific_markers <- marker_test_res %>%
                           dplyr::filter(fraction_expressing >= 0.30) %>%
                            group_by(cell_group) %>%
                            top_n(1, pseudo_R2)

top_specific_marker_ids <- unique(top_specific_markers %>% pull(gene_id))

plot_genes_by_group(dat,
                    top_specific_marker_ids,
                    group_cells_by="cell_type",
                    ordering_type="maximal_on_diag",
                    max.size=4)
## Warning in if (axis_order == "marker_group") {: the condition has length > 1 and
## only the first element will be used

Pseudotime analysis

Pseudotime is a measure of how much progress an individual cell has made through a process such as cell differentiation. Since we have at least one population of cells that is transitioning from a progenitor state to a series of mature neurons, can we identify potential differentiation trajectories?

Learn trajectory graph

First we must learn a ‘trajectory graph’ across each partition (contiguous group of cells) in the data.

dat <- learn_graph(dat)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
plot_cells(dat,
           color_cells_by = "cell_type",
           label_groups_by_cluster=FALSE,
           label_leaves=TRUE,
           label_branch_points=TRUE,
           label_cell_groups = FALSE,
           graph_label_size=3,
           cell_size=0.75) + scale_color_brewer(palette="Set1")

dat <- order_cells(dat, root_pr_nodes="Y_3")

plot_cells(dat,
           color_cells_by = "pseudotime",
           label_cell_groups=FALSE,
           label_leaves=FALSE,
           label_branch_points=FALSE,
           graph_label_size=3,
           cell_size=0.75)

Here we are also setting the ‘root’ of this trajectory to begin at a node in the NPC cells.

Differential gene expression with respect to pseudotime

How do we find the genes that are differentially expressed on the different paths through the trajectory? How do we find the ones that are restricted to the beginning of the trajectory? Or excluded from it? Monocle3 uses a ‘principal graph test’ to test whether cells at similar positions on the trajectory have correlated expression

system.time(pseudotime_pr_test_res<-graph_test(dat, 
                                               neighbor_graph="principal_graph",
                                               cores=4)
)
pr_deg_ids <- row.names(subset(pseudotime_pr_test_res, q_value < 1e-50))

Here are some high-scoring differentially expressed genes along the pseudotime trajectory in their UMAP embedding

pseudotime_genes <- c("Pax6","Btg2","Snap25","Fezf2","Hes6","Cux1")
plot_cells(dat, genes=pseudotime_genes,
           show_trajectory_graph=FALSE,
           label_cell_groups=FALSE,
           label_leaves=FALSE,
           cell_size=0.25)

We can also explicitly look at how the expression of these genes along the pseudotime trajectory. The function plot_genes_in_pseudotime() takes a small set of genes and shows you their dynamics as a function of pseudotime:

pseudotime_lineage_cds <- dat[fData(dat)$gene_short_name %in% pseudotime_genes,
                       pData(dat)$cell_type %in% c("NPCs","Neurons")]
plot_genes_in_pseudotime(pseudotime_lineage_cds,
                         color_cells_by="cell_type",
                         min_expr=0.5)

Bonus: 3D trajectories

dat_3d <- reduce_dimension(dat, max_components = 3, cores=4)
## No preprocess_method specified, using preprocess_method = 'PCA'
## Note: reduce_dimension will produce slightly different output each time you run it unless you set 'umap.fast_sgd = FALSE' and 'cores = 1'
## Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
## Also defined by 'spam'
## Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
## Also defined by 'spam'
dat_3d <- cluster_cells(dat_3d,resolution = 1e-5)
dat_3d <- learn_graph(dat_3d)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#dat_3d <- order_cells(dat_3d)

plot_cells_3d(dat_3d, color_cells_by="cell_type")

Pattern Discovery (NMF)

Finally, we can step beyond marker gene analysis, and use some ‘latent space’ discovery methods to learn patterns of co-regulated gene expression.

These patterns can be used to identify/define: * Cell type identities * Biological processes * Spatial gradients * Other cellular features

… often with significantly greater resolution and precision than marker genes.

Importantly, these patterns are ‘data driven’ and often yield insights into the heterogeneity and complexity of a given dataset that were unanticipated.

nPatterns<-30
# Takes about ~3-4 minutes...
system.time(dat.nnmf<-NNLM::nnmf(as.matrix(log10(exprs(dat)[expressed_genes,]+1)), k = nPatterns,
    n.threads = 6, trace = 10, verbose = TRUE, show.warning = TRUE))
##    user  system elapsed 
## 725.875  30.411 199.605
# Gene x Pattern matrix
dim(dat.nnmf$W)
## [1] 15365    30
# Pattern x Cell matrix
dim(dat.nnmf$H)
## [1]   30 9984
#Add patterns to phenotype data for visualization
patterns.df<-data.frame(t(dat.nnmf$H))
colnames(patterns.df)<-paste0("Pattern_",c(1:nPatterns))

pData(dat)<-cbind(pData(dat),patterns.df)

pdf("figures/patterns.pdf",width=10,height=10)
lapply(c(1:nPatterns),function(i){
plot_cells(dat, color_cells_by=paste0("Pattern_",i), cell_size=0.75) + 
    ggtitle(paste0("Pattern_",i))  + 
    coord_equal(1)
})
## [[1]]
## 
## [[2]]
## 
## [[3]]
## 
## [[4]]
## 
## [[5]]
## 
## [[6]]
## 
## [[7]]
## 
## [[8]]
## 
## [[9]]
## 
## [[10]]
## 
## [[11]]
## 
## [[12]]
## 
## [[13]]
## 
## [[14]]
## 
## [[15]]
## 
## [[16]]
## 
## [[17]]
## 
## [[18]]
## 
## [[19]]
## 
## [[20]]
## 
## [[21]]
## 
## [[22]]
## 
## [[23]]
## 
## [[24]]
## 
## [[25]]
## 
## [[26]]
## 
## [[27]]
## 
## [[28]]
## 
## [[29]]
## 
## [[30]]
dev.off()
## quartz_off_screen 
##                 2
targetPattern<-24
plot_cells(dat, color_cells_by=paste0("Pattern_",targetPattern), cell_size=0.75) + 
    ggtitle(paste0("Pattern ",targetPattern))  + 
    coord_equal(1)

Genes associated with learned patterns

geneWeights.df<-data.frame(dat.nnmf$W)
colnames(geneWeights.df)<-paste0("Pattern_",c(1:nPatterns))

#fData(dat)[head(rownames(dat.nnmf$W)[order(dat.nnmf$W[,patternOfInterest],decreasing=TRUE)]),]
tmp<-as.data.frame(merge(fData(dat)[,c("gene_id","gene_short_name")],geneWeights.df,by=0))

targetPattern<-c(5,8,16,23)
DT::datatable(tmp[,c("gene_id","gene_short_name",
                     unlist(lapply(targetPattern,function(i){paste0("Pattern_",i)
                       }))
                       )])
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Velocity analysis

Finally, we can revist the RNA velocity analysis to make predictions about how cells are traveling through the UMAP embedding. By estimating the potential turnover rates for individual genes, we can then make predictions about where each gene:cell is in this process. By aggregating this information for a given cell, and projecting into the UMAP embedding, we can make an estimate of ‘where’ each cell will be in this embedding in about an hour. This helps us understand the flow of cells through these manifold embeddings.

cell.dist <- as.dist(1-armaCor(t(dat@reducedDims$PCA))) ### TODO: point to cell_data_set PCA projection to calculate cell-cell distance

fit.quantile <- 0.02
system.time(rvel.cd <- gene.relative.velocity.estimates(sf[,cells_use],uf[,cells_use],deltaT=1,kCells=25,fit.quantile=fit.quantile, cell.dist=cell.dist, n.cores=7)) ### TODO: Add cell.dist to velocity estimates
emb <- dat@reducedDims$UMAP
nCols<-length(unique(pData(dat)$cell_type))
cell.colors<-rainbow(nCols)[as.factor(pData(dat)$cell_type)]
names(cell.colors)<-colnames(sf[,cells_use])

system.time(velocity_embedding<-show.velocity.on.embedding.cor(emb,
                               rvel.cd,
                               n=200,
                               scale='sqrt',
                               cell.colors=ac(cell.colors,alpha=0.5),
                               cex=0.8,
                               arrow.scale=3,
                               show.grid.flow=TRUE,
                               min.grid.cell.mass=0.5,
                               grid.n=40,
                               arrow.lwd=1
                               ,do.par=F,
                               cell.border.alpha = 0.1,
                               n.cores=6,
                               return.details=TRUE)
)

Gene-specific velocity plots

gene <- "ENSMUSG00000027210.20" ## Meis2
gene.relative.velocity.estimates(sf[,cells_use],uf[,cells_use],deltaT=1,kCells = 25,kGenes=1,fit.quantile=fit.quantile,cell.emb=emb,cell.colors=cell.colors,cell.dist=cell.dist,show.gene=gene,old.fit=rvel.cd,do.par=T)

gene <- "ENSMUSG00000027168.21" ## Pax6
gene.relative.velocity.estimates(sf[,cells_use],uf[,cells_use],deltaT=1,kCells = 25,kGenes=1,fit.quantile=fit.quantile,cell.emb=emb,cell.colors=ac(cell.colors,alpha=0.5),show.gene=gene,old.fit=rvel.cd,do.par=T)

gene <- "ENSMUSG00000021743.6"
gene.relative.velocity.estimates(sf[,cells_use],uf[,cells_use],deltaT=1,kCells = 100,kGenes=1,fit.quantile=fit.quantile,cell.emb=emb,cell.colors=ac(cell.colors,alpha=0.5),show.gene=gene,old.fit=rvel.cd,do.par=T)

Session Information

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ensembldb_2.10.2                   AnnotationFilter_1.10.0           
##  [3] GenomicFeatures_1.38.2             AnnotationDbi_1.48.0              
##  [5] monocle3_0.2.0                     NNLM_0.4.4                        
##  [7] SingleR_1.0.5                      DropletUtils_1.6.1                
##  [9] SingleCellExperiment_1.8.0         SummarizedExperiment_1.16.1       
## [11] DelayedArray_0.12.2                BiocParallel_1.20.1               
## [13] matrixStats_0.55.0                 Biobase_2.46.0                    
## [15] zeallot_0.1.0                      scales_1.1.0                      
## [17] BUSpaRse_1.1.1                     velocyto.R_0.6                    
## [19] Matrix_1.2-18                      forcats_0.4.0                     
## [21] stringr_1.4.0                      dplyr_0.8.4                       
## [23] purrr_0.3.3                        readr_1.3.1                       
## [25] tidyr_1.0.2                        tibble_2.1.3                      
## [27] ggplot2_3.2.1                      tidyverse_1.3.0                   
## [29] BSgenome.Mmusculus.UCSC.mm10_1.4.0 BSgenome_1.54.0                   
## [31] rtracklayer_1.46.0                 Biostrings_2.54.0                 
## [33] XVector_0.26.0                     GenomicRanges_1.38.0              
## [35] GenomeInfoDb_1.22.0                IRanges_2.20.2                    
## [37] S4Vectors_0.24.3                   biomaRt_2.42.0                    
## [39] AnnotationHub_2.18.0               BiocFileCache_1.10.2              
## [41] dbplyr_1.4.2                       BiocGenerics_0.32.0               
## 
## loaded via a namespace (and not attached):
##   [1] rappdirs_0.3.1                R.methodsS3_1.8.0            
##   [3] coda_0.19-3                   acepack_1.4.1                
##   [5] bit64_0.9-7                   knitr_1.28                   
##   [7] irlba_2.3.3                   R.utils_2.9.2                
##   [9] data.table_1.12.8             rpart_4.1-15                 
##  [11] RCurl_1.98-1.1                generics_0.0.2               
##  [13] RhpcBLASctl_0.20-17           leidenbase_0.1.0             
##  [15] RSQLite_2.2.0                 RANN_2.6.1                   
##  [17] proxy_0.4-23                  bit_1.1-15.2                 
##  [19] xml2_1.2.2                    lubridate_1.7.4              
##  [21] httpuv_1.5.2                  assertthat_0.2.1             
##  [23] viridis_0.5.1                 xfun_0.12                    
##  [25] hms_0.5.3                     evaluate_0.14                
##  [27] promises_1.1.0                fansi_0.4.1                  
##  [29] progress_1.2.2                caTools_1.18.0               
##  [31] readxl_1.3.1                  igraph_1.2.4.2               
##  [33] DBI_1.1.0                     htmlwidgets_1.5.1            
##  [35] spdep_1.1-3                   ellipsis_0.3.0               
##  [37] crosstalk_1.0.0               RSpectra_0.16-0              
##  [39] backports_1.1.5               deldir_0.1-25                
##  [41] RcppParallel_4.4.4            vctrs_0.2.3                  
##  [43] ROCR_1.0-7                    withr_2.1.2                  
##  [45] batchelor_1.2.4               grr_0.9.5                    
##  [47] checkmate_2.0.0               GenomicAlignments_1.22.1     
##  [49] prettyunits_1.1.1             cluster_2.1.0                
##  [51] ExperimentHub_1.12.0          dotCall64_1.0-0              
##  [53] lazyeval_0.2.2                crayon_1.3.4                 
##  [55] units_0.6-5                   slam_0.1-47                  
##  [57] edgeR_3.28.0                  pkgconfig_2.0.3              
##  [59] labeling_0.3                  nlme_3.1-144                 
##  [61] vipor_0.4.5                   ProtGenerics_1.18.0          
##  [63] nnet_7.3-13                   rlang_0.4.4                  
##  [65] lifecycle_0.1.0               modelr_0.1.6                 
##  [67] rsvd_1.0.3                    cellranger_1.1.0             
##  [69] raster_3.0-12                 boot_1.3-24                  
##  [71] Matrix.utils_0.9.7            Rhdf5lib_1.8.0               
##  [73] reprex_0.3.0                  base64enc_0.1-3              
##  [75] beeswarm_0.2.3                png_0.1-7                    
##  [77] viridisLite_0.3.0             bitops_1.0-6                 
##  [79] R.oo_1.23.0                   KernSmooth_2.23-16           
##  [81] spam_2.5-1                    speedglm_0.3-2               
##  [83] blob_1.2.1                    DelayedMatrixStats_1.8.0     
##  [85] classInt_0.4-2                jpeg_0.1-8.1                 
##  [87] memoise_1.1.0                 magrittr_1.5                 
##  [89] plyr_1.8.5                    gplots_3.0.3                 
##  [91] gdata_2.18.0                  zlibbioc_1.32.0              
##  [93] m3addon_0.1                   compiler_3.6.2               
##  [95] dqrng_0.2.1                   RColorBrewer_1.1-2           
##  [97] pcaMethods_1.78.0             Rsamtools_2.2.3              
##  [99] cli_2.0.1                     LearnBayes_2.15.1            
## [101] htmlTable_1.13.3              Formula_1.2-3                
## [103] MASS_7.3-51.5                 mgcv_1.8-31                  
## [105] tidyselect_1.0.0              stringi_1.4.6                
## [107] densityClust_0.3              yaml_2.2.1                   
## [109] BiocSingular_1.2.2            askpass_1.1                  
## [111] h5_0.9.9                      locfit_1.5-9.1               
## [113] latticeExtra_0.6-29           ggrepel_0.8.1                
## [115] pbmcapply_1.5.0               grid_3.6.2                   
## [117] fastmatch_1.1-0               tools_3.6.2                  
## [119] modes_0.7.0                   rstudioapi_0.11              
## [121] foreign_0.8-75                gridExtra_2.3                
## [123] plyranges_1.6.10              farver_2.0.3                 
## [125] Rtsne_0.15                    digest_0.6.25                
## [127] BiocManager_1.30.10           FNN_1.1.3                    
## [129] shiny_1.4.0                   Rcpp_1.0.3                   
## [131] broom_0.5.4                   BiocVersion_3.10.1           
## [133] later_1.0.0                   RcppAnnoy_0.0.14             
## [135] httr_1.4.1                    sf_0.8-1                     
## [137] colorspace_1.4-1              rvest_0.3.5                  
## [139] XML_3.99-0.3                  fs_1.3.1                     
## [141] reticulate_1.14               splines_3.6.2                
## [143] fields_10.3                   uwot_0.1.5                   
## [145] expm_0.999-4                  scater_1.14.6                
## [147] sp_1.4-0                      plotly_4.9.2                 
## [149] spData_0.3.3                  xtable_1.8-4                 
## [151] jsonlite_1.6.1                R6_2.4.1                     
## [153] gmodels_2.18.1                Hmisc_4.3-1                  
## [155] pillar_1.4.3                  htmltools_0.4.0              
## [157] mime_0.9                      DT_0.12                      
## [159] glue_1.3.1                    fastmap_1.0.1                
## [161] BiocNeighbors_1.4.1           class_7.3-15                 
## [163] codetools_0.2-16              interactiveDisplayBase_1.24.0
## [165] maps_3.3.0                    fgsea_1.12.0                 
## [167] lattice_0.20-40               curl_4.3                     
## [169] ggbeeswarm_0.6.0              gtools_3.8.1                 
## [171] openssl_1.4.1                 survival_3.1-8               
## [173] limma_3.42.2                  rmarkdown_2.1                
## [175] munsell_0.5.0                 e1071_1.7-3                  
## [177] rhdf5_2.30.1                  GenomeInfoDbData_1.2.2       
## [179] HDF5Array_1.14.2              haven_2.2.0                  
## [181] reshape2_1.4.3                gtable_0.3.0

Knit Time

With caches

knitEnd <- Sys.time()
knitEnd - knitStart
## Time difference of 20.63913 mins